This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Sun Jun 15 05:10:09 2025.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
# Set CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))
#Install Necessary Packages
install.packages("timetk")
## package 'timetk' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("lubridate")
## package 'lubridate' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("GGally")
## package 'GGally' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("ggthemes")
## package 'ggthemes' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
#load Packages
library(timetk)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggthemes)
library(dplyr)
#Load Dataset
data("bike_sharing_daily")
bike_data <- bike_sharing_daily
View(bike_data)
#Exploratory Analysis
summary(bike_data)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
glimpse(bike_data)
## Rows: 731
## Columns: 16
## $ instant <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
## $ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ casual <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
## $ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
## $ cnt <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
str(bike_data)
## spc_tbl_ [731 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ instant : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date[1:731], format: "2011-01-01" "2011-01-02" ...
## $ season : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: num [1:731] 654 670 1229 1454 1518 ...
## $ cnt : num [1:731] 985 801 1349 1562 1600 ...
## - attr(*, "spec")=
## .. cols(
## .. instant = col_double(),
## .. dteday = col_date(format = ""),
## .. season = col_double(),
## .. yr = col_double(),
## .. mnth = col_double(),
## .. holiday = col_double(),
## .. weekday = col_double(),
## .. workingday = col_double(),
## .. weathersit = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. hum = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. cnt = col_double()
## .. )
#Correlation between Temperature and Total Rental
cor(bike_data$temp, bike_data$cnt)
## [1] 0.627494
cor(bike_data$atemp, bike_data$cnt)
## [1] 0.6310657
#Correlation between Temperature and Casual/Registered User
cor(bike_data$temp, bike_data$registered)
## [1] 0.540012
cor(bike_data$temp, bike_data$casual)
## [1] 0.5432847
#Extract Date Components
bike_data <- bike_sharing_daily %>%
mutate(
date = as.Date(dteday), # Create a new proper date column
month = month(date, label = TRUE),
year = year(date),
weekday = wday(date, label = TRUE),
weekend = ifelse(weekday %in% c("Sat", "Sun"), "Weekend", "Weekday")
)
#Mean and Median Temperature by Season
bike_data %>%
group_by(season) %>%
summarise(mean_temp=mean(temp), median_temp=median(temp))
## # A tibble: 4 × 3
## season mean_temp median_temp
## <dbl> <dbl> <dbl>
## 1 1 0.298 0.286
## 2 2 0.544 0.562
## 3 3 0.706 0.715
## 4 4 0.423 0.409
#Monthly Summary of Temperature, Humidity, Windspeed, and Total Rentals
bike_data %>%
mutate(month=month(date, label=TRUE)) %>%
group_by(month) %>%
summarise(across(c(temp,atemp,hum,windspeed,cnt), mean, .names="mean_{col}"))
## # A tibble: 12 × 6
## month mean_temp mean_atemp mean_hum mean_windspeed mean_cnt
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Jan 0.236 0.240 0.586 0.206 2176.
## 2 Feb 0.299 0.300 0.567 0.216 2655.
## 3 Mar 0.391 0.382 0.588 0.223 3692.
## 4 Apr 0.470 0.457 0.588 0.234 4485.
## 5 May 0.595 0.566 0.689 0.183 5350.
## 6 Jun 0.684 0.638 0.576 0.185 5772.
## 7 Jul 0.755 0.704 0.598 0.166 5564.
## 8 Aug 0.709 0.651 0.638 0.173 5664.
## 9 Sep 0.616 0.579 0.715 0.166 5767.
## 10 Oct 0.485 0.472 0.694 0.175 5199.
## 11 Nov 0.369 0.367 0.625 0.184 4247.
## 12 Dec 0.324 0.326 0.666 0.177 3404.
#Box Plot of Temperature by Season
boxplot(temp~season, data=bike_data,main="Temperature by Season", col="skyblue")
#Correlation Matrix of Continuous Variables
bike_data %>%
select(temp,atemp,hum,windspeed,casual,registered,cnt) %>%
ggpairs(title="Correlation Matrix of Key Variables")
#Time Series of Total Rentals
bike_data %>%
plot_time_series(date,cnt, .interactive=TRUE, .title="Daily Bike Rental Over Time")
#Registered vs Casual Over Time
bike_data %>%
select(date, registered,casual) %>%
pivot_longer(col=-date, names_to="type", values_to="rentals") %>%
plot_time_series(date, rentals, .color_var=type, .interactive=TRUE, .title="Registered vs Casual Users")
#Seasonal Diagonostics
bike_data %>%
plot_seasonal_diagnostics(date, cnt)
#Anomaly Detection
bike_data %>%
plot_anomaly_diagnostics(date,cnt)
## frequency = 7 observations per 1 week
## trend = 92 observations per 3 months
#Load Packages
install.packages("forecast")
## package 'forecast' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
install.packages("TTR")
## package 'TTR' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(TTR)
#Convert to Time Series Object
ts_data <- ts(bike_data$cnt, frequency=365)
ts_clean <- tsclean(ts_data)
#Plot Original vs Clean Data
plot(ts_data, col="red", main="Original vs Cleaned Time Series")
lines(ts_clean, col="blue")
#Simple Exponential Smoothing
ses_model <- HoltWinters(ts_clean, beta=FALSE, gamma=FALSE)
plot(ses_model, main="Simple Exponential Smoothing")
#Simple Moving Average with Order 10
sma_10 <- SMA(ts_clean, n=10)
plot.ts(sma_10, main="10-Day Moving Average", col="darkgreen")
#Load Test for Stationary
install.packages("tseries")
## package 'tseries' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'tseries'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\Mash\AppData\Local\Programs\R\R-4.5.0\library\00LOCK\tseries\libs\x64\tseries.dll
## to
## C:\Users\Mash\AppData\Local\Programs\R\R-4.5.0\library\tseries\libs\x64\tseries.dll:
## Permission denied
## Warning: restored 'tseries'
##
## The downloaded binary packages are in
## C:\Users\Mash\AppData\Local\Temp\Rtmp2XZLGG\downloaded_packages
library(tseries)
#Decompose Time Series
decompose <- decompose(ts_clean)
plot(decompose)
#Remove Seasonality
adjusted_ts <- ts_clean - decompose$seasonal
plot(adjusted_ts, main="Seasonally Adjusted Series")
adf.test(adjusted_ts)
##
## Augmented Dickey-Fuller Test
##
## data: adjusted_ts
## Dickey-Fuller = -3.9571, Lag order = 9, p-value = 0.01114
## alternative hypothesis: stationary
#Plot ACF and PACF
acf(adjusted_ts)
pacf(adjusted_ts)
#Differencing if Non-Stationary
diff_ts <- diff(adjusted_ts)
plot(diff_ts, main="Differenced Series")
#Auto ARIMA M
auto_model <- auto.arima(ts_clean)
summary(auto_model)
## Series: ts_clean
## ARIMA(1,0,3)(0,1,0)[365] with drift
##
## Coefficients:
## ar1 ma1 ma2 ma3 drift
## 0.9683 -0.5912 -0.1279 -0.0937 5.7116
## s.e. 0.0224 0.0571 0.0617 0.0576 0.8318
##
## sigma^2 = 986021: log likelihood = -3042.81
## AIC=6097.63 AICc=6097.86 BIC=6121.05
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 5.85301 697.8113 385.8648 -2.699882 9.189324 0.1694626
## ACF1
## Training set -0.003587803
#Manual ARIMA Example (Can be Optimised)
manual_model <- arima(ts_clean, order=c(1,1,1))
summary(manual_model)
##
## Call:
## arima(x = ts_clean, order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## 0.3692 -0.8751
## s.e. 0.0437 0.0213
##
## sigma^2 estimated as 661725: log likelihood = -5928.18, aic = 11862.37
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 11.13247 812.9082 588.1202 -6.429274 19.42943 0.894064 0.01109983
#Residual Diagnostic
shapiro.test(residuals(auto_model))
##
## Shapiro-Wilk normality test
##
## data: residuals(auto_model)
## W = 0.83801, p-value < 2.2e-16
acf(residuals(auto_model))
pacf(residuals(auto_model))
#Forecast Next 25 Days
forecast_auto <- forecast(auto_model, h=25)
forecast_manual <- forecast(manual_model, h=25)
#Plot Forecast
plot(forecast_auto, main="Forecast with Auto ARIMA")
plot(forecast_manual, main="Forecast with Manual ARIMA")
#Visualise Patterns Across Seasons and Weekdays
bike_data %>%
group_by(season, weekday) %>%
summarise(avg_count = mean(cnt)) %>%
ggplot(aes(x = weekday, y = season, fill = avg_count)) +
geom_tile() +
scale_fill_viridis_c() +
labs(title = "Heatmap: Average Rentals by Weekday and Season",
x = "Weekday", y = "Season", fill = "Avg Rentals") +
theme_minimal()
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
install.packages("ggthemes")
## Warning: package 'ggthemes' is in use and will not be installed
library(ggthemes)
bike_data %>%
ggplot(aes(x = date, y = cnt)) +
geom_line(color = "steelblue", size = 1) +
labs(title = "Bike Rentals Over Time",
subtitle = "Capital Bikeshare Program (2011–2012)",
x = "Date", y = "Rental Count") +
theme_economist()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This time series forecasting project successfully modeled the daily bike rental demand for the Capital Bikeshare system. The rigorous data cleaning, decomposition, and stationarity testing ensured a stable time series structure conducive to ARIMA modeling. The use of both manual and automated ARIMA models enabled comparative diagnostics, where Auto ARIMA was preferred due to its lower error metrics and better residual properties.
Overall, this analysis demonstrates the feasibility of data-driven demand forecasting in urban mobility. It highlights how weather and calendar features critically influence public transportation usage. These insights are highly valuable for urban planners and operational teams aiming to optimize bike distribution, plan fleet expansion, or build adaptive pricing strategies.
The project not only meets the goals of forecasting but also illustrates the importance of statistical rigor in transforming raw data into actionable intelligence.